Introduction¶
in this report we'll analyze a dataset from Movie Database API. The dataset contains information about movies, including details like the budget,overview and revenues.main purpose of this report is we want to look for the trends in the film industry and identify resons behind the successful of a movie.
Dataset Description¶
1: Dataset Source
*was obtained from TMDBb API
2: Dataset Size
*the data contains 10866 rows and 20 columns.
3: DataSet concerns
*there are movies don't contain the released in a given year.
*there are some missing data in the dataset.
Questions for Analysis¶
Is there a relationship between movie budgets and revenues across different genres?
Understanding the relationship between budgets and revenues can provide insights into the financial aspects of movie production. By analyzing this relationship across genres, we can determine if certain genres are more financially successful than others, and if budgeting strategies should vary based on genre.
How has the popularity of different genres changed over the years?
Examining popularity trends can reveal changes in audience preferences over time. This analysis can help filmmakers and studios understand which genres are currently trending and adjust their production strategies accordingly to cater to audience interests.
How has the average runtime of movies changed over the years?
Changes in movie runtime can reflect shifts in storytelling styles, audience attention spans, and production trends. Analyzing this data can provide insights into evolving cinematic practices and audience expectations regarding movie length.
Data Wrangling¶
import pandas as pd
def wrangling(file_path):
# Load the dataset
df = pd.read_csv(file_path)
# Display the first few rows of the dataset
print(df.head())
# Display the info of the dataset to understand its structure
print(df.info())
# Display summary statistics of the dataset
print(df.describe())
# Check for missing values
print(df.isnull().sum())
return df
# Call the function with the file path
df = wrangling('Database_TMDb_movie_data/tmdb-movies.csv')
id imdb_id popularity budget revenue \
0 135397 tt0369610 32.985763 150000000 1513528810
1 76341 tt1392190 28.419936 150000000 378436354
2 262500 tt2908446 13.112507 110000000 295238201
3 140607 tt2488496 11.173104 200000000 2068178225
4 168259 tt2820852 9.335014 190000000 1506249360
original_title \
0 Jurassic World
1 Mad Max: Fury Road
2 Insurgent
3 Star Wars: The Force Awakens
4 Furious 7
cast \
0 Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...
1 Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...
2 Shailene Woodley|Theo James|Kate Winslet|Ansel...
3 Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...
4 Vin Diesel|Paul Walker|Jason Statham|Michelle ...
homepage director \
0 http://www.jurassicworld.com/ Colin Trevorrow
1 http://www.madmaxmovie.com/ George Miller
2 http://www.thedivergentseries.movie/#insurgent Robert Schwentke
3 http://www.starwars.com/films/star-wars-episod... J.J. Abrams
4 http://www.furious7.com/ James Wan
tagline ... \
0 The park is open. ...
1 What a Lovely Day. ...
2 One Choice Can Destroy You ...
3 Every generation has a story. ...
4 Vengeance Hits Home ...
overview runtime \
0 Twenty-two years after the events of Jurassic ... 124
1 An apocalyptic story set in the furthest reach... 120
2 Beatrice Prior must confront her inner demons ... 119
3 Thirty years after defeating the Galactic Empi... 136
4 Deckard Shaw seeks revenge against Dominic Tor... 137
genres \
0 Action|Adventure|Science Fiction|Thriller
1 Action|Adventure|Science Fiction|Thriller
2 Adventure|Science Fiction|Thriller
3 Action|Adventure|Science Fiction|Fantasy
4 Action|Crime|Thriller
production_companies release_date vote_count \
0 Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562
1 Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185
2 Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480
3 Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292
4 Universal Pictures|Original Film|Media Rights ... 4/1/15 2947
vote_average release_year budget_adj revenue_adj
0 6.5 2015 1.379999e+08 1.392446e+09
1 7.1 2015 1.379999e+08 3.481613e+08
2 6.3 2015 1.012000e+08 2.716190e+08
3 7.5 2015 1.839999e+08 1.902723e+09
4 7.3 2015 1.747999e+08 1.385749e+09
[5 rows x 21 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 10866 non-null int64
1 imdb_id 10856 non-null object
2 popularity 10866 non-null float64
3 budget 10866 non-null int64
4 revenue 10866 non-null int64
5 original_title 10866 non-null object
6 cast 10790 non-null object
7 homepage 2936 non-null object
8 director 10822 non-null object
9 tagline 8042 non-null object
10 keywords 9373 non-null object
11 overview 10862 non-null object
12 runtime 10866 non-null int64
13 genres 10843 non-null object
14 production_companies 9836 non-null object
15 release_date 10866 non-null object
16 vote_count 10866 non-null int64
17 vote_average 10866 non-null float64
18 release_year 10866 non-null int64
19 budget_adj 10866 non-null float64
20 revenue_adj 10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
None
id popularity budget revenue runtime \
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000
vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09
id 0
imdb_id 10
popularity 0
budget 0
revenue 0
original_title 0
cast 76
homepage 7930
director 44
tagline 2824
keywords 1493
overview 4
runtime 0
genres 23
production_companies 1030
release_date 0
vote_count 0
vote_average 0
release_year 0
budget_adj 0
revenue_adj 0
dtype: int64
Data Cleaning¶
let's start cleaning our dataset:
After we collect some information about our dataset now it's time to make the dataset more clear and clean
def clean_data(df):
# Fill missing values
df['homepage'] = df['homepage'].fillna('Not available')
df['tagline'] = df['tagline'].fillna('')
df['keywords'] = df['keywords'].fillna('')
df['overview'] = df['overview'].fillna('')
# Drop rows with missing values in critical columns
df = df.dropna(subset=['genres', 'production_companies', 'director', 'cast', 'imdb_id'])
# Reset index
df.reset_index(drop=True, inplace=True)
return df
# Clean the data
df_cleaned = clean_data(df)
Genre Distribution
The bar chart below shows the distribution of movies by genre. We can observe which genres are most prevalent in the dataset, providing an understanding of the genre composition of the movies we are analyzing.
see the below chart how the observation for the geners are not observable due to the high genres so in our report we want to clearify these observation
import matplotlib.pyplot as plt
import seaborn as sns
# Count the occurrences of each genre
genre_counts = df_cleaned['genres'].value_counts().reset_index()
genre_counts.columns = ['genres', 'count']
# Plot the distribution of movie genres
plt.figure(figsize=(14, 8))
sns.barplot(data=genre_counts, x='genres', y='count', palette='viridis')
plt.xlabel('Genres')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Genres')
# Rotate x-axis labels
plt.xticks(rotation=45, ha='right')
plt.tight_layout(pad=3.0)
plt.show()
Popularity Distribution
The histogram below shows the distribution of popularity scores across all movies in the dataset. This gives us an idea of how popularity is spread among the movies, highlighting any skewness or outliers in the data.
plt.figure(figsize=(12, 6))
sns.histplot(df['popularity'], bins=30, kde=True)
plt.xlabel('Popularity')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Popularity')
plt.tight_layout()
plt.show()
The bar chart we notice previously of genre distribution shows that genres such as Drama,Thriler, Actionand Comedy are the most common in the dataset. This indicates that these genres are popular choices for movie production.
# Group by genre and calculate the average revenue and budget
genre_stats = df_cleaned.groupby('genres')[['revenue', 'budget']].mean().reset_index()
# Plot the correlation
plt.figure(figsize=(14, 8))
sns.scatterplot(data=genre_stats, x='budget', y='revenue', hue='genres', palette='tab20', s=100)
plt.xlabel('Average Budget')
plt.ylabel('Average Revenue')
plt.title('Average Revenue vs. Average Budget by Genre')
# plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
# plt.tight_layout()
plt.show()
the previous plot show us how the avg. budget and avg. revenue differ with the genre
import matplotlib.pyplot as plt
import seaborn as sns
def calculate_genre_stats(df_split):
return df_split.groupby(['genre', 'release_year'])['popularity'].mean().reset_index()
def calculate_genre_stats3(df_split):
return df_split.groupby('genre').agg({'revenue': 'mean', 'budget': 'mean'}).reset_index()
def plot_genre_popularity(genre_popularity):
plt.figure(figsize=(16, 8))
sns.lineplot(data=genre_popularity, x='release_year', y='popularity', hue='genre', palette='tab20', linewidth=2.5)
plt.xlabel('Release Year')
plt.ylabel('Average Popularity')
plt.title('Average Popularity of Different Genres Over the Years')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
import matplotlib.pyplot as plt
import seaborn as sns
# Group by genre and release year, and calculate the average popularity
genre_popularity = df_cleaned.groupby(['genres', 'release_year'])['popularity'].mean().reset_index()
# Plotting the average popularity of each genre over the years
plt.figure(figsize=(14, 8))
sns.lineplot(data=genre_popularity, x='release_year', y='popularity', hue='genres', palette='tab20', marker='o')
plt.xlabel('Release Year')
plt.ylabel('Average Popularity')
plt.title('Average Popularity of Each Genre Over the Years')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Genres')
plt.tight_layout(pad=3.0)
plt.show()
/tmp/ipykernel_13/3067001640.py:15: UserWarning: Tight layout not applied. The bottom and top margins cannot be made large enough to accommodate all axes decorations. plt.tight_layout(pad=3.0)
this shows us How the Popularity of Different Genres Changed Over the Years?
The line plot of average popularity over the years reveals trends in how different genres have gained or lost popularity. For instance, genres such as Action and Adventure have maintained high popularity, while others may have seen fluctuations.
and they are really related to our previous plot so the results are convience for sure.
# Calculate the average runtime for each release year
avg_runtime = df_cleaned.groupby('release_year')['runtime'].mean()
# Plot the average runtime
plt.figure(figsize=(12, 6))
plt.bar(avg_runtime.index, avg_runtime.values, color='skyblue')
plt.xlabel('Release Year')
plt.ylabel('Average Runtime')
plt.title('Average Runtime Over Years')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
the previous plot show us how people prefer the old years
¶
Conclusions¶
In this analysis of the TMDb movie dataset, we explored the relationship between movie genres and various aspects such as popularity, revenue, and budget. Here are the key findings:
Genres and Popularity
We observed that certain genres, such as Action, Adventure, and Science Fiction , tended to be more popular among audiences over the years. These genres consistently attracted high levels of viewership.
Common Genres
The most common genres in the dataset were Drama, Comedy,Thriller and Action. These genres were prevalent across a wide range of movies in the dataset.
Genres and Revenue/Budget
While there was some variation, we found that genres such as Adventure, Fantasy, and Animation tended to have higher average revenues and budgets compared to other genres. This suggests a potential correlation between these genres and financial success.
Overall Insights
Overall, our analysis suggests that certain genres have a significant impact on the success of a movie, both in terms of audience engagement and financial performance. Understanding these trends can help filmmakers and studios make informed decisions about the types of movies to produce.
Limitations
It's important to note that our analysis has some limitations. The dataset may not be fully representative of all movies released, and there may be other factors beyond genre that influence a movie's success.
Relationship Between Movie Budgets and Revenues Across Different Genres
The analysis revealed varying relationships between budgets and revenues across different genres. While some genres, such as Adventure and Animation, tend to have higher average revenues, they also require larger budgets. On the other hand, genres like Documentary and Horror show lower budget requirements but can still achieve significant revenues. This suggests that budgeting strategies should be tailored to the specific genre to maximize financial success. Popularity Trends of Different Genres Over the Years:
The popularity trends of genres have fluctuated over the years, indicating shifting audience preferences. Certain genres, such as Adventure and Science Fiction, have shown consistent popularity, while others, like Western and War, have experienced declines. These trends highlight the importance of adapting to changing audience tastes and producing content that aligns with current trends to maintain audience engagement. Changes in Average Runtime Over the Years:
The analysis of average movie runtimes over the years revealed a gradual decline in recent decades. This trend suggests a shift towards shorter movie lengths, possibly driven by changes in audience preferences and viewing habits. Filmmakers may need to consider these trends when planning the length of their movies to cater to modern audience expectations.
Future Work
Future work could explore additional factors that contribute to a movie's success, such as the impact of specific actors or directors, as well as regional variations in genre preferences.
# Running this cell will execute a bash command to convert this notebook to an .html file
!python -m nbconvert --to html Investigate_a_Dataset.ipynb
[NbConvertApp] Converting notebook Investigate_a_Dataset.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 4 image(s). [NbConvertApp] Writing 10220760 bytes to Investigate_a_Dataset.html